A common approach to determine the cost of products is the should cost method. It consists in estimating what a product should cost based on materials, labor, overhead, and profit margin. Although this strategy is very accurate, it has the drawback of being tedious and it requires expert knowledge of industrial technologies and processes. To get a quick estimation, it is possible to build a statistical model to predict the price of products given their characteristics. With such a model, it would no longer be necessary to be an expert or to wait several days to assess the impact of a design modification, a change in supplier or a change in production site. Before builing a model, it is important to explore the data which is the aim of this case study. This study was commissioned by a cosmetics company that wants to estimate the price of Screw Caps of shampoo bottles.
Let’s first load the database study it’s structure and load the différent packages.
#Loading the different packages for this study
library(dplyr)
library(readr)
library(ggplot2)
library(FactoMineR)
library(cluster)
library(fpc)
library(factoextra)
library(FactoInvestigate)
library(plotly)
Screw Caps Dataset
Now, we load the dataset used for this study:
#Loading the dataset
dataset <- read.table("ScrewCaps.csv", header = TRUE, sep = ",", dec = ".", row.names = 1)
#Printing the dataset
head(dataset)
#Understanding the structure
print(paste0("DB Dimensions: ", dim(dataset)[1]," X " , dim(dataset)[2] ))
[1] "DB Dimensions: 195 X 11"
summary(dataset)
Supplier Diameter weight nb.of.pieces Shape Impermeability
Supplier A: 31 Min. :0.4458 Min. :0.610 Min. : 2.000 Shape 1:134 Type 1:172
Supplier B:150 1st Qu.:0.7785 1st Qu.:1.083 1st Qu.: 3.000 Shape 2: 45 Type 2: 23
Supplier C: 14 Median :1.0120 Median :1.400 Median : 4.000 Shape 3: 8
Mean :1.2843 Mean :1.701 Mean : 4.113 Shape 4: 8
3rd Qu.:1.2886 3rd Qu.:1.704 3rd Qu.: 5.000
Max. :5.3950 Max. :7.112 Max. :10.000
Finishing Mature.Volume Raw.Material Price Length
Hot Printing: 62 Min. : 1000 ABS: 21 Min. : 6.477 Min. : 3.369
Lacquering :133 1st Qu.: 15000 PP :148 1st Qu.:11.807 1st Qu.: 6.161
Median : 45000 PS : 26 Median :14.384 Median : 8.086
Mean : 96930 Mean :16.444 Mean :10.247
3rd Qu.:115000 3rd Qu.:18.902 3rd Qu.:10.340
Max. :800000 Max. :46.610 Max. :43.359
The data ScrewCap.csv contains 195 lots of screw caps described by 11 variables. Diameter, weight, length are the physical characteristics of the cap; nb.of.pieces corresponds to the number of elements of the cap (the picture above corresponds to a cap with 2 pieces: the valve (clapet) is made of a different material); Mature.volume corresponds to the number of caps ordered and bought by the compagny (number in the lot). All the categorical features are Factors. The other features are numerical.
Univariate and bivariate descriptive statistics
Price distribution
d <- density(dataset$Price)
#Plotting the histogram
hist(dataset$Price, breaks=40, probability = TRUE, main = "Price distribution",
xlab = "Price")
#Plotting the density
lines(d, col = "red")
We have here a bimodal distribution and we can describe it in more details with the quantiles:
p <- plot_ly(type = 'box') %>% add_boxplot(y = dataset$Price, jitter = 0.3, pointpos = -1.8, boxpoints = 'all',
marker = list(color = 'rgb(7,40,89)'),
line = list(color = 'rgb(7,40,89)'),
name = "All Points") %>% layout( title = 'Price Boxplot', yaxis = list(title = 'Price'))
p
Using this plotly boxplot we notice that we have 25% of the prices between 6.477451 and 11.807022. 50% between 6.477451 and 14.384413 and 75% between 6.477451 and 18.902429. The remaning 25‰ are data located in a wide range of prices between 18.902429 and 46.610372
Price dependency on length
Let’s study now the price dependency on lenght.
p <- ggplot(data=dataset, aes(x= Length, y= Price)) + geom_point(size=1) + geom_smooth(method=lm) + ggtitle(" Price versus lenght ")
ggplotly(p)
fit_price_lenght <- lm(Price~Length, data=dataset)
summary(fit_price_lenght)
Call:
lm(formula = Price ~ Length, data = dataset)
Residuals:
Min 1Q Median 3Q Max
-13.901 -2.854 -0.741 1.931 16.181
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.94613 0.50918 17.57 <2e-16 ***
Length 0.73168 0.03953 18.51 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.308 on 193 degrees of freedom
Multiple R-squared: 0.6397, Adjusted R-squared: 0.6378
F-statistic: 342.6 on 1 and 193 DF, p-value: < 2.2e-16
We can observe a dependence. 63.9 % of the variability of the price is explained by the lenght.
Now we study the price dependency on weight.
p <- ggplot(data=dataset, aes(x= weight, y= Price)) + geom_point(size=1) + geom_smooth(method=lm) + ggtitle(" Price versus weight ")
ggplotly(p)
fit_price_weight <- lm(Price~weight, data=dataset)
summary(fit_price_weight)
Call:
lm(formula = Price ~ weight, data = dataset)
Residuals:
Min 1Q Median 3Q Max
-14.7993 -2.6207 -0.6631 2.5396 13.8357
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.2275 0.5602 14.69 <2e-16 ***
weight 4.8312 0.2718 17.78 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.419 on 193 degrees of freedom
Multiple R-squared: 0.6208, Adjusted R-squared: 0.6189
F-statistic: 316 on 1 and 193 DF, p-value: < 2.2e-16
We can also observe a dependence. 62 % of the variability of the price is explained by the weight.
Now we will discuss the price dependency on some categorical features such as Impermeability, Shape and Supplier.
p <- plot_ly(type = 'box') %>% add_boxplot(y = dataset$Price , x = dataset$Impermeability, jitter = 0.3, pointpos = -1.8, boxpoints = 'all',
marker = list(color = 'rgb(7,40,89)'),
line = list(color = 'rgb(7,40,89)'),
name = "Price box") %>% layout( title = 'Price versus Impermeability Boxplot', yaxis = list(title = 'Price'))
p
Can't display both discrete & non-discrete data on same axisCan't display both discrete & non-discrete data on same axis
fit_price_impermeability <- lm(Price~ Impermeability, data=dataset)
summary(fit_price_impermeability)
Call:
lm(formula = Price ~ Impermeability, data = dataset)
Residuals:
Min 1Q Median 3Q Max
-16.4106 -3.0187 -0.6286 2.4897 25.0638
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.7236 0.4117 35.77 <2e-16 ***
ImpermeabilityType 2 14.5846 1.1986 12.17 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.399 on 193 degrees of freedom
Multiple R-squared: 0.4341, Adjusted R-squared: 0.4312
F-statistic: 148 on 1 and 193 DF, p-value: < 2.2e-16
The boxplot show us that each impermeability type gather a wide range of prices :
However, we notice a dependence. 43 % of the variability of the price is explained by the impermeability type. Plus, we observe that the price range is statistically different for Type 1 and Type 2. Type 1 is statistically cheaper than Type 2 :
Concerning the price dependency on Shape :
p <- plot_ly(type = 'box') %>% add_boxplot(y = dataset$Price , x = dataset$Shape, jitter = 0.3, pointpos = -1.8, boxpoints = 'all',
marker = list(color = 'rgb(7,40,89)'),
line = list(color = 'rgb(7,40,89)'),
name = "Price box") %>% layout( title = 'Price versus Shape Boxplot', yaxis = list(title = 'Price'))
p
Can't display both discrete & non-discrete data on same axisCan't display both discrete & non-discrete data on same axis
fit_price_shape <- lm(Price~ Shape, data=dataset)
summary(fit_price_shape)
Call:
lm(formula = Price ~ Shape, data = dataset)
Residuals:
Min 1Q Median 3Q Max
-11.098 -3.850 -1.025 3.055 25.587
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.2006 0.5406 26.267 < 2e-16 ***
ShapeShape 2 8.1403 1.0782 7.550 1.75e-12 ***
ShapeShape 3 1.4510 2.2777 0.637 0.52485
ShapeShape 4 7.4393 2.2777 3.266 0.00129 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.258 on 191 degrees of freedom
Multiple R-squared: 0.2475, Adjusted R-squared: 0.2357
F-statistic: 20.94 on 3 and 191 DF, p-value: 9.008e-12
In this case, it’s hard to notice a dependence between price and shape. Only 24 % of the variability of the price is explained by the Shape type. However, there is some insights :
Concerning the price dependency on Suppliers :
p <- plot_ly(type = 'box') %>% add_boxplot(y = dataset$Price , x = dataset$Supplier, jitter = 0.3, pointpos = -1.8, boxpoints = 'all',
marker = list(color = 'rgb(7,40,89)'),
line = list(color = 'rgb(7,40,89)'),
name = "Price box") %>% layout( title = 'Price versus Impermeability Boxplot', yaxis = list(title = 'Price'))
p
Can't display both discrete & non-discrete data on same axisCan't display both discrete & non-discrete data on same axis
fit_price_supplier <- lm(Price~ Supplier, data=dataset)
summary(fit_price_supplier)
Call:
lm(formula = Price ~ Supplier, data = dataset)
Residuals:
Min 1Q Median 3Q Max
-11.431 -4.491 -1.847 2.873 30.349
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.029 1.285 14.033 <2e-16 ***
SupplierSupplier B -1.768 1.411 -1.252 0.212
SupplierSupplier C -3.140 2.303 -1.363 0.174
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.153 on 192 degrees of freedom
Multiple R-squared: 0.01174, Adjusted R-squared: 0.001449
F-statistic: 1.141 on 2 and 192 DF, p-value: 0.3217
There is no dependency on the price. Let’s study the prices in more details :
PriceComp_avg <- dataset %>% select(Supplier,Price) %>% group_by(Supplier) %>% summarise(Average_Price = mean(Price))
head(PriceComp_avg)
PriceComp_min <- dataset %>% select(Supplier,Price) %>% group_by(Supplier) %>% summarise(Minimum_Price = min(Price))
head(PriceComp_min)
PriceComp_avg_price_per_weight <- PriceComp_min <- dataset %>% select(Supplier,Price,weight) %>% group_by(Supplier) %>% summarise(Price_per_weight = mean(Price)/mean(weight))
head(PriceComp_avg_price_per_weight)
In terms of average price, the supplier C is the less expensive. In terms of absolute price, the supplier B is the less expensive. However, Supplier B is also the supplier which has the highest absolute price. *In terms of average price / weight Supplier A has is the less expensive.
One important point in exploratory data analysis consists in identifying potential outliers. Let’s identify this outliers given different features. For Mature.Volume variable :
d <- density(dataset$Mature.Volume)
hist(dataset$Mature.Volume, breaks=40, probability = TRUE, main = "Mature Volume distribution",
xlab = "Mature Volume")
lines(d, col = "red")
We can clearly notice here an outlier. We can now remove it
dataset <- dataset %>% filter ( Mature.Volume < 600000 )
Let’s verify the data now :
d <- density(dataset$Mature.Volume)
hist(dataset$Mature.Volume, breaks=40, probability = TRUE,main = "Mature Volume distribution",
xlab = "Mature Volume")
lines(d, col = "red")
After studying the other features distribution, we notive that there is no other outliers. Plus, it seems that every numerical feature have the same trend structure. Please find below the other distributions.
d <- density(dataset$Diameter)
hist(dataset$Diameter, breaks=40, probability = TRUE, main = "Diameter distribution",
xlab = "Mature Volume")
lines(d, col = "red")
d <- density(dataset$weight)
hist(dataset$weight, breaks=40, probability = TRUE, main = "Mature Volume distribution",
xlab = "Distribution")
lines(d, col = "red")
d <- density(dataset$nb.of.pieces)
hist(dataset$nb.of.pieces, breaks=40, probability = TRUE, main = "Nb of pieces distribution",
xlab = "Nb of pieces")
lines(d, col = "red")
d <- density(dataset$Length)
hist(dataset$Length, breaks=40, probability = TRUE, main = "Lenght distribution",
xlab = "Lenght")
lines(d, col = "red")
Now we will perform a PCA on the data. A PCA will allows us to fin a low-dimensional reprensation of the data that captures the “essense” of the raw data. Plus, a PCA will allows us denoise the data. This preprocessing and data exploration helps us to better understand/visualise the relations between the differents features and observations and to prepare the data for the prediction process. PCA deals with continuous variables but categorical variables are the projection of the categories at the barycentre of the observations which take the categories.
As we want to predict the price and we have Supplier, Shape, Impermeability and Finishing as qualitative variables, we will consider this last ones as illustrative. Let’s now process the PCA :
res.pca <- PCA(dataset, quali.sup=c(1,5,6,7,9), quanti.sup = 10, scale = TRUE)
summary(res.pca, nbelements = 10)
Call:
PCA(X = dataset, scale.unit = TRUE, quanti.sup = 10, quali.sup = c(1,
5, 6, 7, 9))
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
Variance 3.107 1.067 0.777 0.049 0.000
% of var. 62.142 21.338 15.537 0.976 0.006
Cumulative % of var. 62.142 83.481 99.018 99.994 100.000
Individuals (the 10 first)
Dist Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
1 | 4.259 | 4.026 2.731 0.894 | -1.247 0.763 0.086 | -0.008 0.000 0.000 |
2 | 4.740 | 4.674 3.681 0.972 | -0.633 0.196 0.018 | 0.466 0.146 0.010 |
3 | 4.739 | 4.662 3.662 0.968 | -0.671 0.221 0.020 | 0.518 0.181 0.012 |
4 | 0.966 | 0.050 0.000 0.003 | 0.451 0.100 0.218 | -0.796 0.427 0.680 |
5 | 1.644 | -0.771 0.100 0.220 | -0.314 0.048 0.036 | 1.408 1.337 0.734 |
6 | 0.802 | -0.507 0.043 0.399 | 0.577 0.163 0.518 | 0.178 0.021 0.049 |
7 | 1.123 | 0.253 0.011 0.051 | -0.185 0.017 0.027 | -1.070 0.772 0.908 |
8 | 1.145 | 0.605 0.062 0.279 | 0.959 0.452 0.702 | -0.154 0.016 0.018 |
9 | 1.153 | 0.622 0.065 0.291 | 0.959 0.451 0.692 | -0.149 0.015 0.017 |
10 | 1.165 | 0.647 0.071 0.309 | 0.958 0.450 0.676 | -0.142 0.014 0.015 |
Variables
Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
Diameter | 0.985 31.233 0.970 | -0.025 0.061 0.001 | 0.144 2.674 0.021 |
weight | 0.977 30.750 0.955 | -0.029 0.077 0.001 | 0.103 1.371 0.011 |
nb.of.pieces | -0.202 1.309 0.041 | 0.843 66.557 0.710 | 0.499 32.076 0.249 |
Mature.Volume | -0.412 5.458 0.170 | -0.596 33.258 0.355 | 0.690 61.213 0.476 |
Length | 0.985 31.250 0.971 | -0.023 0.048 0.001 | 0.144 2.666 0.021 |
Supplementary continuous variable
Dim.1 cos2 Dim.2 cos2 Dim.3 cos2
Price | 0.796 0.634 | 0.171 0.029 | 0.131 0.017 |
Supplementary categories (the 10 first)
Dist Dim.1 cos2 v.test Dim.2 cos2 v.test Dim.3 cos2 v.test
Supplier A | 0.591 | 0.548 0.859 1.813 | -0.055 0.009 -0.308 | -0.214 0.131 -1.416 |
Supplier B | 0.144 | -0.065 0.206 -0.949 | -0.126 0.759 -3.109 | -0.027 0.035 -0.782 |
Supplier C | 1.675 | -0.444 0.070 -0.976 | 1.441 0.739 5.407 | 0.728 0.189 3.203 |
Shape 1 | 0.496 | -0.426 0.736 -4.859 | -0.138 0.077 -2.687 | -0.214 0.186 -4.888 |
Shape 2 | 1.530 | 1.427 0.870 6.196 | 0.394 0.066 2.921 | 0.383 0.063 3.325 |
Shape 3 | 0.668 | -0.560 0.703 -0.915 | -0.332 0.248 -0.927 | 0.060 0.008 0.195 |
Shape 4 | 1.427 | -0.552 0.150 -0.902 | 0.356 0.062 0.992 | 1.265 0.786 4.138 |
Type 1 | 0.450 | -0.450 1.000 -9.517 | -0.002 0.000 -0.066 | -0.009 0.000 -0.389 |
Type 2 | 3.290 | 3.289 1.000 9.517 | 0.013 0.000 0.066 | 0.067 0.000 0.389 |
Hot Printing | 0.354 | -0.286 0.653 -1.551 | -0.038 0.011 -0.349 | 0.192 0.295 2.083 |
fviz_pca_ind(res.pca)
fviz_pca_ind(res.pca, col.ind="cos2", label=c("quali"), geom = "point") + scale_color_gradient2(low="lightblue", mid="blue", high="darkblue", midpoint=0.6)+ theme_minimal()
Before commenting this graphs, let’s compute also the correlation matrix :
X <- scale(as.matrix(dataset %>% select(-c(1,5,6,7,9,10))))
as.data.frame(cov(X))
NA
The variable factor map shows us that :
This correlations are well explained in the covariance matrix. The cells that correspond to a combinaison of highly correlated features have a cov higher than 0.9 and the cells that correspond to a combinaison of uncorrelated (orthogonal) features have a cov lower than 0.2 - 0.3.
The PCA focuses on the relationships between the continuous variables. In fact, the PCA compute the vectors which are the synthetic variables the most correlated to all the continuous variables. Then, its possible to study the projection of the observations/features on this vectors and discuss the link between them. The issue is that the PCA does not handle categorical variables to the computation of the synthetic vectors.
Let’s now focus on the individual factor map : The barycentre related to \(Impermeability = Type2\) and \(Raw.Material = PS\) are near to the first synthetic axe which means that this two categories Type2 and PS are highly correlated to this axe and then, given the previous analysis to Lenght, weight, price and diameter. In fact, for instance, we have seen that Type 2 have a higher price in average than Type 1. We can say also for example than a PS product is related to a high diameter.
plot(res.pca$eig[,3], type="l", ylab = "Cumulative percentage of inertia", xlab = "Nb of synthetic vectors")
The R object with the two principal components which are the synthetic variables the most correlated to all the variables is the two eigen vectors of the PCA linked to the two highest eigenvalues.
as.data.frame(res.pca$var$coord[,1:2])
PCA is often used as a pre-processing step before applying a clustering algorithm. In fact, we often perform the CAH or the k-means on the \(k\) principal components to denoise the data. In this setting \(k\) is choosen as large since we do not want to loose any information, but want to discard the last components that can be considered as noise. Consequently, we keep the number of dimensions \(k\) such that we reach 95% of the inertia in PCA. In our case we have \(k=3\) (cf last graph)
Let’s now perform a kmeans algorithm on the selected k principal components of PCA.
# We keep the 3 first components of the PCA
dat <- res.pca$ind$coord[,1:3]
#Performing the clustering
clus <- kmeans(dat, 3, nstart = 20)
#Visualizing the clusters
plot(dat, col = clus$cluster, pch = 19, frame = FALSE, main = "K-means with k = 3")
points(clus$centers, col = 1:4, pch = 8, cex = 3)
# Visualizing the
fviz_nbclust(dat, kmeans, method = "wss") + geom_vline(xintercept = 3, linetype = 2)
Using “methode du coude” we find that the optimal number of cluster is 3.
#Performing a PCA on the 3 principal compenents
res.pca3 <- PCA(dataset, quali.sup=c(1,5,6,7,9), quanti.sup = 10, scale = TRUE, ncp = 3)
#res.pca2 <- PCA(dataset, quali.sup=c(1,5,6,7,9), quanti.sup = 10, scale = TRUE, ncp = 2)
#res.pca4 <- PCA(dataset, quali.sup=c(1,5,6,7,9), quanti.sup = 10, scale = TRUE, ncp = 4)
# Performing the AHC on the 3 principal components of the PCA
res.hcpc3 <- HCPC(res.pca3, nb.clust = -1)
#res.hcpc2 <- HCPC(res.pca2, nb.clust = -1)
#res.hcpc4 <- HCPC(res.pca4, nb.clust = -1)
#plot(res.hcpc2$call$t$within[1:14])
plot(res.hcpc3$call$t$within[1:14])
#plot(res.hcpc4$call$t$within[1:14])
The cluster 1 is made of individuals sharing : - high values for the variable Mature.Volume. - low values for the variables nb.of.pieces, Price, weight, Length and Diameter (variables are sorted from the weakest).
The cluster 2 is made of individuals sharing : - high values for the variable nb.of.pieces. - low values for the variables Mature.Volume, Diameter, Length, weight and Price (variables are sorted from the weakest).
The cluster 3 is made of individuals such as 89, 90, 131, 161, 163 and 164. This group is characterized by : - high values for the variables Length, Diameter, weight and Price (variables are sorted from the strongest). - low values for the variables nb.of.pieces and Mature.Volume (variables are sorted from the weakest).
res.hcpc2$desc.var$quanti.var
Eta2 P-value
Length 0.8077126 4.913463e-68
Diameter 0.8068180 7.600787e-68
weight 0.8049086 1.915936e-67
Mature.Volume 0.5465422 5.181644e-33
Price 0.5051780 1.897610e-29
nb.of.pieces 0.3011815 2.345580e-15
res.hcpc3$desc.var$quanti.var
Eta2 P-value
Length 0.8036361 3.530117e-67
Diameter 0.8025769 5.853470e-67
weight 0.8013378 1.053993e-66
Mature.Volume 0.7588382 8.649758e-59
Price 0.4812030 1.620918e-27
nb.of.pieces 0.1760507 1.243497e-08
res.hcpc4$desc.var$quanti.var
Eta2 P-value
Length 0.8036361 3.530117e-67
Diameter 0.8025769 5.853470e-67
weight 0.8013378 1.053993e-66
Mature.Volume 0.7588382 8.649758e-59
Price 0.4812030 1.620918e-27
nb.of.pieces 0.1760507 1.243497e-08
res.hcpc3$desc.var$test.chi2
p.value df
Impermeability 5.318642e-18 2
Raw.Material 5.226547e-17 4
Shape 5.626207e-06 6
Supplier 4.102258e-02 4
res.hcpc2$desc.var$category$`1`
Cla/Mod Mod/Cla Global p.value v.test
Raw.Material=PP 26.388889 95.0 75.392670 4.760733e-04 3.493870
Impermeability=Type 1 23.809524 100.0 87.958115 3.056564e-03 2.961991
Raw.Material=ABS 4.761905 2.5 10.994764 4.405004e-02 -2.013614
Supplier=Supplier C 0.000000 0.0 7.329843 3.260506e-02 -2.136913
Raw.Material=PS 3.846154 2.5 13.612565 1.378402e-02 -2.462843
Impermeability=Type 2 0.000000 0.0 12.041885 3.056564e-03 -2.961991
Shape=Shape 2 2.222222 2.5 23.560209 9.002617e-05 -3.916011
res.hcpc3$desc.var$category$`1`
Cla/Mod Mod/Cla Global p.value v.test
Raw.Material=PP 25.000000 97.297297 75.392670 0.0001368886 3.813724
Impermeability=Type 1 22.023810 100.000000 87.958115 0.0049829303 2.808135
Supplier=Supplier C 0.000000 0.000000 7.329843 0.0434829294 -2.019041
Raw.Material=PS 3.846154 2.702703 13.612565 0.0222292828 -2.286427
Raw.Material=ABS 0.000000 0.000000 10.994764 0.0081544536 -2.645607
Impermeability=Type 2 0.000000 0.000000 12.041885 0.0049829303 -2.808135
Shape=Shape 2 2.222222 2.702703 23.560209 0.0002330416 -3.680210
res.hcpc4$desc.var$category$`1`
Cla/Mod Mod/Cla Global p.value v.test
Raw.Material=PP 25.000000 97.297297 75.392670 0.0001368886 3.813724
Impermeability=Type 1 22.023810 100.000000 87.958115 0.0049829303 2.808135
Supplier=Supplier C 0.000000 0.000000 7.329843 0.0434829294 -2.019041
Raw.Material=PS 3.846154 2.702703 13.612565 0.0222292828 -2.286427
Raw.Material=ABS 0.000000 0.000000 10.994764 0.0081544536 -2.645607
Impermeability=Type 2 0.000000 0.000000 12.041885 0.0049829303 -2.808135
Shape=Shape 2 2.222222 2.702703 23.560209 0.0002330416 -3.680210
res.hcpc2$desc.var$category$`2`
Cla/Mod Mod/Cla Global p.value v.test
Impermeability=Type 1 72.61905 93.846154 87.958115 0.0005815138 3.440093
Supplier=Supplier C 100.00000 10.769231 7.329843 0.0036107554 2.910306
Raw.Material=PP 72.91667 80.769231 75.392670 0.0146197613 2.441664
Raw.Material=PS 38.46154 7.692308 13.612565 0.0009776513 -3.296880
Impermeability=Type 2 34.78261 6.153846 12.041885 0.0005815138 -3.440093
res.hcpc3$desc.var$category$`2`
Cla/Mod Mod/Cla Global p.value v.test
Impermeability=Type 1 74.40476 93.984962 87.958115 0.0002868332 3.626910
Supplier=Supplier C 100.00000 10.526316 7.329843 0.0050545222 2.803538
Raw.Material=PP 74.30556 80.451128 75.392670 0.0173789787 2.378590
Raw.Material=PS 38.46154 7.518797 13.612565 0.0004703729 -3.497084
Impermeability=Type 2 34.78261 6.015038 12.041885 0.0002868332 -3.626910
res.hcpc4$desc.var$category$`2`
Cla/Mod Mod/Cla Global p.value v.test
Impermeability=Type 1 74.40476 93.984962 87.958115 0.0002868332 3.626910
Supplier=Supplier C 100.00000 10.526316 7.329843 0.0050545222 2.803538
Raw.Material=PP 74.30556 80.451128 75.392670 0.0173789787 2.378590
Raw.Material=PS 38.46154 7.518797 13.612565 0.0004703729 -3.497084
Impermeability=Type 2 34.78261 6.015038 12.041885 0.0002868332 -3.626910
res.hcpc3$desc.var
$test.chi2
p.value df
Impermeability 5.318642e-18 2
Raw.Material 5.226547e-17 4
Shape 5.626207e-06 6
Supplier 4.102258e-02 4
$category
$category$`1`
Cla/Mod Mod/Cla Global p.value v.test
Raw.Material=PP 25.000000 97.297297 75.392670 0.0001368886 3.813724
Impermeability=Type 1 22.023810 100.000000 87.958115 0.0049829303 2.808135
Supplier=Supplier C 0.000000 0.000000 7.329843 0.0434829294 -2.019041
Raw.Material=PS 3.846154 2.702703 13.612565 0.0222292828 -2.286427
Raw.Material=ABS 0.000000 0.000000 10.994764 0.0081544536 -2.645607
Impermeability=Type 2 0.000000 0.000000 12.041885 0.0049829303 -2.808135
Shape=Shape 2 2.222222 2.702703 23.560209 0.0002330416 -3.680210
$category$`2`
Cla/Mod Mod/Cla Global p.value v.test
Impermeability=Type 1 74.40476 93.984962 87.958115 0.0002868332 3.626910
Supplier=Supplier C 100.00000 10.526316 7.329843 0.0050545222 2.803538
Raw.Material=PP 74.30556 80.451128 75.392670 0.0173789787 2.378590
Raw.Material=PS 38.46154 7.518797 13.612565 0.0004703729 -3.497084
Impermeability=Type 2 34.78261 6.015038 12.041885 0.0002868332 -3.626910
$category$`3`
Cla/Mod Mod/Cla Global p.value v.test
Impermeability=Type 2 65.2173913 71.428571 12.04188 2.909966e-12 6.982005
Raw.Material=PS 57.6923077 71.428571 13.61257 4.169068e-11 6.597941
Shape=Shape 2 31.1111111 66.666667 23.56021 9.932485e-06 4.418638
Shape=Shape 1 5.3846154 33.333333 68.06283 6.715709e-04 -3.400930
Impermeability=Type 1 3.5714286 28.571429 87.95812 2.909966e-12 -6.982005
Raw.Material=PP 0.6944444 4.761905 75.39267 2.869539e-13 -7.300381
$quanti.var
Eta2 P-value
Length 0.8036361 3.530117e-67
Diameter 0.8025769 5.853470e-67
weight 0.8013378 1.053993e-66
Mature.Volume 0.7588382 8.649758e-59
Price 0.4812030 1.620918e-27
nb.of.pieces 0.1760507 1.243497e-08
$quanti
$quanti$`1`
v.test Mean in category Overall mean sd in category Overall sd p.value
Mature.Volume 11.942982 2.431183e+05 82206.026178 67166.762125 9.103190e+04 7.064414e-33
Diameter -3.255425 8.214269e-01 1.294639 0.254233 9.821218e-01 1.132228e-03
Length -3.297003 6.491733e+00 10.329589 2.056760 7.864783e+00 9.772253e-04
weight -3.536244 1.100262e+00 1.714121 0.315574 1.172854e+00 4.058595e-04
nb.of.pieces -3.780986 3.324324e+00 4.115183 1.274576 1.413225e+00 1.562083e-04
Price -3.857939 1.245686e+01 16.552332 4.115901 7.172431e+00 1.143473e-04
$quanti$`2`
v.test Mean in category Overall mean sd in category Overall sd p.value
nb.of.pieces 5.739229 4.503759 4.115183 1.352381e+00 1.413225e+00 9.510856e-09
Price -2.999298 15.521715 16.552332 4.620374e+00 7.172431e+00 2.706026e-03
weight -5.297059 1.416482 1.714121 3.882302e-01 1.172854e+00 1.176825e-07
Length -5.533133 8.244766 10.329589 2.492726e+00 7.864783e+00 3.145605e-08
Diameter -5.565997 1.032748 1.294639 3.121379e-01 9.821218e-01 2.606582e-08
Mature.Volume -8.031930 47177.255639 82206.026178 3.971314e+04 9.103190e+04 9.595124e-16
$quanti$`3`
v.test Mean in category Overall mean sd in category Overall sd p.value
Length 12.298789 30.295407 10.329589 7.979008e+00 7.864783e+00 9.194180e-35
Diameter 12.294570 3.787033 1.294639 1.000521e+00 9.821218e-01 9.687099e-35
weight 12.254018 4.680731 1.714121 1.164251e+00 1.172854e+00 1.598746e-34
Price 9.282815 30.295414 16.552332 8.814239e+00 7.172431e+00 1.650583e-20
Mature.Volume -3.281665 20542.857143 82206.026178 1.547128e+04 9.103190e+04 1.031962e-03
nb.of.pieces -3.659694 3.047619 4.115183 7.221786e-01 1.413225e+00 2.525166e-04
attr(,"class")
[1] "catdes" "list "
res.hcpc2$desc.var$quanti$`2`
v.test Mean in category Overall mean sd in category Overall sd p.value
nb.of.pieces 7.560366 4.646154 4.115183 1.317699e+00 1.413225e+00 4.019369e-14
Price -2.074083 15.813052 16.552332 4.466612e+00 7.172431e+00 3.807163e-02
weight -4.858401 1.430947 1.714121 3.790811e-01 1.172854e+00 1.183376e-06
Length -5.035751 8.361397 10.329589 2.475781e+00 7.864783e+00 4.759789e-07
Diameter -5.052860 1.048024 1.294639 3.099323e-01 9.821218e-01 4.352422e-07
Mature.Volume -6.596988 52362.115385 82206.026178 5.350229e+04 9.103190e+04 4.195953e-11
res.hcpc3$desc.var$quanti$`2`
v.test Mean in category Overall mean sd in category Overall sd p.value
nb.of.pieces 5.739229 4.503759 4.115183 1.352381e+00 1.413225e+00 9.510856e-09
Price -2.999298 15.521715 16.552332 4.620374e+00 7.172431e+00 2.706026e-03
weight -5.297059 1.416482 1.714121 3.882302e-01 1.172854e+00 1.176825e-07
Length -5.533133 8.244766 10.329589 2.492726e+00 7.864783e+00 3.145605e-08
Diameter -5.565997 1.032748 1.294639 3.121379e-01 9.821218e-01 2.606582e-08
Mature.Volume -8.031930 47177.255639 82206.026178 3.971314e+04 9.103190e+04 9.595124e-16
res.hcpc4$desc.var$quanti$`2`
v.test Mean in category Overall mean sd in category Overall sd p.value
nb.of.pieces 5.739229 4.503759 4.115183 1.352381e+00 1.413225e+00 9.510856e-09
Price -2.999298 15.521715 16.552332 4.620374e+00 7.172431e+00 2.706026e-03
weight -5.297059 1.416482 1.714121 3.882302e-01 1.172854e+00 1.176825e-07
Length -5.533133 8.244766 10.329589 2.492726e+00 7.864783e+00 3.145605e-08
Diameter -5.565997 1.032748 1.294639 3.121379e-01 9.821218e-01 2.606582e-08
Mature.Volume -8.031930 47177.255639 82206.026178 3.971314e+04 9.103190e+04 9.595124e-16
catdes(dataset, num.var= 7)
Chi-squared approximation may be incorrectChi-squared approximation may be incorrect
$test.chi2
p.value df
Shape 0.01040072 3
Raw.Material 0.03255977 2
$category
$category$`Hot Printing`
Cla/Mod Mod/Cla Global p.value v.test
Raw.Material=PP 37.50000 87.096774 75.392670 0.00809671 2.648010
Shape=Shape 4 75.00000 9.677419 4.188482 0.01693360 2.388146
Shape=Shape 1 27.69231 58.064516 68.062827 0.04420603 -2.012132
Raw.Material=PS 15.38462 6.451613 13.612565 0.04263939 -2.027225
$category$Lacquering
Cla/Mod Mod/Cla Global p.value v.test
Raw.Material=PS 84.61538 17.054264 13.612565 0.04263939 2.027225
Shape=Shape 1 72.30769 72.868217 68.062827 0.04420603 2.012132
Shape=Shape 4 25.00000 1.550388 4.188482 0.01693360 -2.388146
Raw.Material=PP 62.50000 69.767442 75.392670 0.00809671 -2.648010
$quanti.var
Eta2 P-value
Mature.Volume 0.02787439 0.02097347
$quanti
$quanti$`Hot Printing`
v.test Mean in category Overall mean sd in category Overall sd p.value
Mature.Volume 2.301333 104128.8 82206.03 105905.8 91031.9 0.02137281
$quanti$Lacquering
v.test Mean in category Overall mean sd in category Overall sd p.value
Mature.Volume -2.301333 71669.5 82206.03 80851.43 91031.9 0.02137281
attr(,"class")
[1] "catdes" "list "
res.famd = FAMD(dataset, ncp = 5, sup.var = c(10, 7,1,5) )
res.famd$eig
eigenvalue percentage of variance cumulative percentage of variance
comp 1 4.2175168 52.718960 52.71896
comp 2 1.1946156 14.932695 67.65165
comp 3 1.0329612 12.912015 80.56367
comp 4 0.7585481 9.481851 90.04552
comp 5 0.4747152 5.933940 95.97946
res.hcpc <- HCPC(res.famd, nb.clust = -1, graph = FALSE)
Chi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrect
plot.HCPC(res.hcpc, choice = "map", draw.tree = FALSE, select = "drawn", title = '')
res.famd
*The results are available in the following objects:
name description
1 "$eig" "eigenvalues and inertia"
2 "$var" "Results for the variables"
3 "$ind" "results for the individuals"
4 "$quali.var" "Results for the qualitative variables"
5 "$quanti.var" "Results for the quantitative variables"
centroids <- res.hcpc$call$X %>% group_by(clust) %>% summarise(Dim.1 = mean(Dim.1), Dim.2 = mean(Dim.2), Dim.3 = mean(Dim.3), Dim.4 = mean(Dim.4),
Dim.5 = mean(Dim.5))
centroids
P <- dataset[3,]
famd = FAMD(dataset, ncp = 5, sup.var = c(10,7,1,5))
COORD= as.matrix( predict.FAMD(famd, P)$coord)
C1 = as.matrix( as.data.frame(centroids[1,2:6]))
C2 = as.matrix( as.data.frame(centroids[2,2:6]))
C3 = as.matrix( as.data.frame(centroids[3,2:6]))
C4 = as.matrix( as.data.frame(centroids[4,2:6]))
C5 = as.matrix( as.data.frame(centroids[5,2:6]))
norm(COORD - C1 )
[1] 7.165657
norm(COORD - C2 )
[1] 6.581067
norm(COORD - C3 )
[1] 5.896929
norm(COORD - C4 )
[1] 3.201154
norm(COORD - C5 )
[1] 0.9706492
PRICE <- res.hcpc$data.clust %>% group_by(clust) %>% summarise(price_moy = mean(Price), price_sd = sd(Price))
PRICE
res.hcpc$data.clust
# Linear Regression
fit <- lm( Price ~ Diameter + weight + nb.of.pieces + Impermeability +
Mature.Volume + Raw.Material + Length + Supplier + Finishing + Shape, data= dataset)
P <- dataset[168,]
P2 <- P %>% select(-c(10))
P
predict(fit, P2)
168
14.56898
new_data_set <- res.hcpc$data.clust
View(new_data_set)
res_famd_new = FAMD(new_data_set, ncp = 6, sup.var = c(10,7,1,5) )
new_data_set_bis <- as.data.frame(res_famd_new$ind$coord)
new_data_set_bis$Price <- dataset$Price
new_data_set_bis <- new_data_set_bis %>% rename(Dim1 = Dim.1, Dim2 = Dim.2,Dim3 = Dim.3,Dim4 = Dim.4, Dim5 = Dim.5, Dim6 = Dim.6)
P <- new_data_set[84:90,]
predict_price <- function(v) {
fit_final <- lm(Price ~ Dim1 + Dim2 + Dim3 + Dim4 + Dim5 + Dim6 , data= new_data_set_bis)
P_bis <- as.data.frame(predict.FAMD(res_famd_new, v)$coord)
P_bis <- P_bis %>% rename(Dim1 = 'Dim 1', Dim2 = 'Dim 2',Dim3 = 'Dim 3',Dim4 = 'Dim 4', Dim5 = 'Dim 5', Dim6 = 'Dim 6')
return(predict(fit_final, P_bis))
}
predict_price(P)
841 851 861 871 881 891 901
41.44650 41.69060 16.88254 20.21065 11.37100 11.61257 17.48260
learning_data_set <- rbind(dataset[1:23,], dataset[30:117,],dataset[178:188,])
testing_data_set <- rbind(dataset[24:39,], dataset[118:177,],dataset[189:191,])
res_famd_learning = FAMD( learning_data_set , ncp = 5, sup.var = c(10, 7,1,5) )
res_hcpc_learning <- HCPC(res.famd, nb.clust = -1, graph = FALSE)
Chi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrect
new_data_set_learning <- res_hcpc_learning$data.clust
res_famd_new_learning = FAMD(new_data_set_learning, ncp = 6, sup.var = c(10,7,1,5) )
new_data_set_bis_learning <- as.data.frame(res_famd_new$ind$coord)
new_data_set_bis_learning$Price <- dataset$Price
new_data_set_bis_learning <- new_data_set_bis_learning %>% rename(Dim1 = Dim.1, Dim2 = Dim.2,Dim3 = Dim.3,Dim4 = Dim.4, Dim5 = Dim.5, Dim6 = Dim.6)
predict_price_final <- function(v) {
fit_final <- lm(Price ~ Dim1 + Dim2 + Dim3 + Dim4 + Dim5 + Dim6 , data= new_data_set_bis_learning)
centroids <- res_hcpc_learning$call$X %>% group_by(clust) %>% summarise(Dim.1 = mean(Dim.1), Dim.2 = mean(Dim.2), Dim.3 = mean(Dim.3), Dim.4 = mean(Dim.4),Dim.5 = mean(Dim.5))
COORD <- as.matrix( predict.FAMD(res_famd_learning, v)$coord)
C1 <- as.matrix( as.data.frame(centroids[1,2:6]))
C2 <- as.matrix( as.data.frame(centroids[2,2:6]))
C3 <- as.matrix( as.data.frame(centroids[3,2:6]))
C4 <- as.matrix( as.data.frame(centroids[4,2:6]))
C5 <- as.matrix( as.data.frame(centroids[5,2:6]))
k <- which.min(c(norm(COORD - C1 ),norm(COORD - C2 ),norm(COORD - C3 ),norm(COORD - C4 ),norm(COORD - C5 )))
v$clust <- k
v_bis <- as.data.frame(predict.FAMD(res_famd_new, v)$coord)
v_bis <- v_bis %>% rename(Dim1 = 'Dim 1', Dim2 = 'Dim 2',Dim3 = 'Dim 3',Dim4 = 'Dim 4', Dim5 = 'Dim 5', Dim6 = 'Dim 6')
return(predict(fit_final, v_bis))
}
P <- new_data_set[29,]
predict_price_final(P)
291
13.42327
predict_price_final <- function(v) {
fit_final <- lm(Price ~ Dim1 + Dim2 + Dim3 + Dim4 + Dim5 + Dim6 , data= new_data_set_bis_learning)
centroids <- res_hcpc_learning$call$X %>% group_by(clust) %>% summarise(Dim.1 = mean(Dim.1), Dim.2 = mean(Dim.2), Dim.3 = mean(Dim.3), Dim.4 = mean(Dim.4),Dim.5 = mean(Dim.5))
COORD <- as.matrix( predict.FAMD(res_famd_learning, v)$coord)
C1 <- as.matrix( as.data.frame(centroids[1,2:6]))
C2 <- as.matrix( as.data.frame(centroids[2,2:6]))
C3 <- as.matrix( as.data.frame(centroids[3,2:6]))
C4 <- as.matrix( as.data.frame(centroids[4,2:6]))
C5 <- as.matrix( as.data.frame(centroids[5,2:6]))
k <- which.min(c(norm(COORD - C1 ),norm(COORD - C2 ),norm(COORD - C3 ),norm(COORD - C4 ),norm(COORD - C5 )))
v$clust <- k
v_bis <- as.data.frame(predict.FAMD(res_famd_new, v)$coord)
v_bis <- v_bis %>% rename(Dim1 = 'Dim 1', Dim2 = 'Dim 2',Dim3 = 'Dim 3',Dim4 = 'Dim 4', Dim5 = 'Dim 5', Dim6 = 'Dim 6')
return(predict(fit_final, v_bis))
}
vect <- c()
for (i in 1:79) {
v <- testing_data_set[i,]
vect <- c(vect,predict_price_final(v))
}
M<- as.data.frame(cbind(vect,testing_data_set$Price, abs(testing_data_set$Price-vect)))
View(M)